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Abstract 



Very large datasets are often encountered in climatology, either from a multiplicity of 
observations over time and space or outputs from deterministic models (sometimes in 
petabytes= 1 million gigabytes). Loading a large data vector and sorting it, is impossible 
sometimes due to memory limitations or computing power. We show that a proposed 
algorithm to approximating the median, "the median of the median" performs poorly. 
Instead we develop an algorithm to approximate quantiles of very large datasets which 
works by partitioning the data or use existing partitions (possibly of non-equal size). We 
show the deterministic precision of this algorithm and how it can be adjusted to get 

-y ■ customized precisions. 
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^ '. 1 Introduction 

m 

This paper develops an algorithm for approximating the quantiles in petascale 
(petabyte= one million gigabytes) datasets and uses the "probability loss function" 
to assess the quality of the approximation. The need for such an approximation 
does not arise for the sample average, another common data summary. That is 
because if we break down the data to equal partitions and calculate the mean for 
every partition, the mean of the obtained means is equal to the total mean. It is 
^ also easy to recover the total mean from the means of unequal partitions if their 

length is known. 

However computer memories, several gigabytes (GBs) in size, cannot handle 
large datasets that can be petabytes (PBs) in size. For example, a laptop with 2 GBs 
of memory, using the well-known R package, could find the median of a data file of 
about 150 megabytes (MBs) in size. However, it crashed for files larger than this. 
Since large datasets are commonly assembled in blocks, say by day or by district, 
that need not be a serious limitation except insofar as the quantiles computed in that 
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way cannot be used to find the overall quantile. Nor would it help to sub-sample 
these blocks, unless these (possibly dependent) sub-samples could be combined into 
a grand sub-sample whose quantile could be computed. That will not usually be 
possible in practice. The algorithm proposed here is a "worst-case" algorithm in the 
sense that no matter how the data are arranged, we will reach the desired precision. 
This is of course not true if we sample from the data because there is a (perhaps 
small) probability that the approximation could be poor. 
We also address the following question: 

Question: If we partition the data-file into a number of sub-files and 
compute the medians of these, is the median of the medians a good ap- 
proximation to the median of the data-file? 

We first show that the median of the medians does not approximate the exact 
median well in general, even after imposing conditions on the number of partitions 
or their length. However for our proposed algorithm, we show how the partitioning 
idea can be employed differently to get good approximations. "Coarsening" is intro- 
duced to summarize data vector with the purpose of inferring about the quantiles 
of the original vector using the summaries. Then the "d-coarsening" quantile algo- 
rithm which works by partitioning the data (or use previously defined partitions) 
to possibly non-equal partitions, summarizing them using coarsening and inferring 
about the quantiles of the original data vector using the summaries. Then we show 
the deterministic accuracy of the algorithm in Theorem 16.11 The accuracy is mea- 
sured in terms of the probab ility loss func t ion of the original data vector. This is 



an extension of the work in lAlsabti et al.l [1997] to non-equal size partition case 



Theorem 16.11 still requires the partition sizes to be divisible by d the coarsening 
factor. In order to extend the results further to the case where the partitions are 
not divisible by d, we investigate how quantiles of a data vector with missing data 
or contaminated data relate to the quantiles of the original data in Lemma 16.21 and 
Lemma 16.31 Also in Lemma 17.11 we show if the quantiles of a coarsened vector are 
used in place of the quantiles of the original data vector how much accuracy will be 
lost. Finally we investigate the performance of the algorithm using both simulations 
and real climate datasets. 

We define the loss of estimating/approximating a quantile q by q to be the 
probability that the random variable falls in between the two values. A limited ver- 
sion of this concept only for data vectors can be found in computer science literature, 
where e-ap proximations are use d to approximate quantiles of large datasets. (See 



for example iManku et al.l 1998].) However, this concept has not been introduced as 



a measure of loss and the definition is limited to data vectors rather than arbitrary 
distributions. 

The traditional definition of quantiles for a random variable X with distribu- 
tion function F, 

Iqxip) = mi{x\F(x) > p}, 

appears in classic w orks aslParzenl |1979| . We call this the "left quantile function". 
In some books (e.g. iRychlikl 20011 ]) the quantile is defined as 



r Qx(p) = sup{x\F(x) < p}, 

this is what we call the "right quantile function". Also in robustness literature 
people talk about the upp e r and lower medians which are a very specific case of 
these definitions. iHosseinil 2009| considers both definitions, explore their relation 
and show that considering both has several advantages. 

Lemma 1.1: (Quantile Properties Lemma) Suppose X is a random variable on the 
probability space (fi, S, P) with distribution function F: 



F{lq F {p)) > p. 

lq F {p) < rq F (p). 

Pi < P2 => rq F (p 1 ) < lq F {p2)- 

r qF{p) — sup{x|F(x) < p}. 

P(lq F (p) < X < rq F {p)) = 0. i.e. F is flat in the interval (lq F (p),rq F (p)). 

P(X < rq F (p)) < p. 

If lq F {p) < rq F (p) then F(lq F {p)) = p and hence P(X > rq F {p)) = 1 — p. 

lq F {l) > -oo,rq F {0) < oo and P{rq F {0) <X< lq F {l)) = 1. 

lq F (p) and rq F (p) are non- decreasing functions of p. 

If P(X = x)>0 then lq F (F(x)) = x. 

x < lq F {p) =>■ F(x) < p and x > rq F {p) =^- F(x) > p. 
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Finding quantiles and using them to summarize data is of great importance in many 
fields. One example is the climate studies where we have very large datasets. For 
example the datasets created by computer climate models are larger than PBs in 
size. In NCAR (National Center for Atmospheric sciences at Boulder, Colorado), 
the climate data (outputs of compute models) are saved on several disks. To access 
different parts of these data a robot needs to change disks form a very large storage 
space. Another case where we confront large datasets is in dealing with data streams 
which arise in many different applications such as finance and high-speed network- 
ing. For many applications, approximate answers suffice. In computer science, 
quantiles are important to both data base implementers and data base users. They 
can also be used by business intelligence applications to drive summary information 
from huge datasets. 



As pointed out in iManku et al. 
rithm should 



1998], a good quantile approximation algo- 



1. not require prior knowledge of the arrival or value distribution of its inputs. 

2. provide explicit and tunable approximation guarantees. 

3. compute results in a single pass. 

4. produce multiple quantiles at no extra cost. 

5. use as little memory as possible. 

6. be simple to code and understand. 

Finding quantiles of data vectors and sorting them are parallel problems since once 
we sort a vector finding any given quantile can be done in stantly. A g ood ac- 
count of early work in sor ting algorithms can be found in iKnuthl 1973J . Also 
Munro and Patersonl jl98Qj | showed for P-pass algorithms (algorithms that scan 
the data P times) Q(N/P) storage locations are necessary and sufficient, where 
N is the length of the dataset. (See Appendix C for the definitions of complex- 
ity functions such as 9.) It is well- known that t h e wor st-ca se complexity of sort- 



ing is nlog 2 n + 0(1) as shown in IManku et al.l 1999J . In IPatersorJ [1997J . Pa- 
terson discusses the progress made in the so-called "selection" problem. He lets 
Vfc (n) be the worst-case minimum number of pairwise comparisons required to find 



the k-th larges t out of n "dis t inct e lements" 
\n/2], 



k 



In Blum and John 1973 



In particular M(n) = Vk(n) for 
it is shown that the lower bound for Vk(n) 
is n + min{fc — 1, n — k} — 1, an achieved upper bound by Blum is 5.43n. Better 
upper bounds have been achieved through the years. The best upper bound so far 
is 2.9423iV and the lower bound is (2 + a)N where a is of order 2~ 40 . 



Yaol 1974J shows that finding approximate median needs Q(N) comparisons 



in deterministic algorithms. Using sampling this can be reduced to 0(4log(5 x )) 
independent of N, where e is the accuracy of the approxima tion in terms of the 



'probability loss" in our notation. iMunro and Patersonl 1980J show that 0(N l l p ) 



is necessary and sufficient to find an exact 0-quantile in p passes. 

Often an exact quantile is not needed. A related problem is finding space- 
efficient one-pass algorithms to find approximate quantiles. A summary of th e work 
done in this subject and a new method is given in lAgrawal and Swamil 19951 ] . Two 
approxi mate quantile algori t hms u sing only a constant amount o f memory were 



given in iJain and Chlamtad |1985| and lAgrawal and Swamil |1995| . No guarantee 



for the error was given. lAlsabti et al.l |1997| provide an algorithm and guaranteed 



error in one pass. This algorithm works by partitioning the data into subsets, sum- 
marizing each partition and then finding the final quantiles using the summarized 
partitions. The algorithm in this chapter is an extension of this algorithm to the 
case of partitions of unequal length. 



3 The median of the medians 



A proposed algorithm to approximate the median of a very large data vector parti- 
tions the data into subsets of equal length, computes the median for each partition 
and then computes the median of the medians. For example, suppose n = Im and 
break the data to m vectors of size /. One might conjecture that by picking I or 
m sufficiently large the median of the medians would ensure close proximity to the 
exact median. We show by an example that taking I and m very large will not help 
to get close to the exact median. Let / = 26 + 1 and m = 2a + 1. 
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partition number 




Partition 




Median of the partition 


1 


(1,2, 


••■ ,6,6+1, 10 b , ■• • 


,10") 


6+1 


2 


(1,2, 


••■ ,6,6 + 1, 10 fc , ■ • • 


,10") 


6+1 


a 


(1,2, 


••■ ,6,6 + l,10 fc ,-- • 


,10") 


6+1 


a+1 


(1,2, 


••■ ,6,6+ l,10 fc ,-- - 


,10") 


10" 


a+2 




(10",10",--- ,10") 




10" 


2a+l 




(10",10",--- ,10") 




10" 



Tab. 1: The table of data 



Table [Q shows the dataset partitioned into m = 2a + 1 vectors of equal length. 
Every vector is of length Z = 26 + 1. The first a + 1 vectors are identical and 10 fe is 
repeated 6 times in them. The last a vectors are also identical with all components 
equal to 10 6 . The median of the medians turns out to be 6+ 1. However, the median 
of the dataset is 10 b . We show that 6 + 1 is in fact "almost" the first quantile. This 
is because (6 + 1) is smaller than all 10 6, s. There are (a + l)6 + a(26 + 1) data points 
equal to 10 b . Hence 6 + 1 is smaller than this fraction of the data points: 



(a + l)6 + a(26+l) 2a + 2 b a ,113 

H wlx- + -«-. 



(2a+l)(26+l) 2a + 146 + 2 2a + 1 4 2 4 

With a similar argument, we can show that 6 + 1 is greater than almost a quarter 
of the data points (the ones equal to 1, 2, • ■ ■ ,6). Hence 6 + 1 is "almost" the first 
quantile. 

One can prove a rigorous version of the the following statement. 

The median of the medians is "almost" between the first and the third quartile. 

We only give a heuristic argument for simplicity. To that end, let n = Im 
and m = 2a + 1 and I = 26 + 1. Let M be the exact median and M' be the median 



of the medians. Order the obtained medians of each partition and denote them 
by Mi, • • • , M m . By definition W > Mj, j < a and M' < M 6 , j > a + 1. Each 
Mj, j < a is less than or equal to b data points in its partition. Hence, we conclude 
that M' is less than or equal to ab data points. Similarly M' is greater than or 
equal to ab data points (which are disjoint for the data points used before). But 
^n = (2a+i)(Wi) ~ I' Hence, M' is greater than or equal to 1/4 data points and less 
than or equal to 1/4 data points. 

4 Preliminary results 

Suppose y' e {y±, ■ ■ ■ ,y n }, for future reference, we define some additional notations 
for data vectors. 

Definition 4.1: The minimal index of y', m(y') and the maximal index of y', M(y') 
are defined as below: 

m(y') = min{% = y'}, M(y') = max{% = y'}. 

It is easy to see that in y = sort(x) = (y±, ■ • • , y n ) all the coordinates between 
m(y') and M(y') are equal to y' . Also note that if y' = Z{ then M(y')— m(y') + l = rrii 
is the multiplicity of Zi. We use the notation m x and M x whenever we want to 
emphasize that they depend on the data vector x. 

Lemma 4.1: Suppose x = (x±, ■ ■ ■ ,x n ), y = sort(x) and z a non-decreasing vector 
of all distinct elements of x. Then 

a) m(z i+1 ) = M(zi) + 1, % = 0, • • • , r - 1. 

b) Suppose is a bijective increasing transformation over M, 

m^(x)(0(^)) =m x (zi), 

and 

M <Kx) (<f>(z i ))=M x (zi), 

for % — 1, • • • ,r. 

Proof a) is straightforward, 
b) Note that 

m x {y') = mm{i\yi = y'} = min{i|0(?/i) = <j)(y')} = m^ x) ((j)(y')). 
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A similar argument works for M x . ■ 

We also define the position and standardized position of an element of a data 
vector. 

Definition 4.2: Let x = (xi, • • • ,x n ) be a vector and y = sort(x) = (2/1, •• • ,y~n). 
Then for y' G {y±, • • • , y n }, we define 

pos x (y') = {m x (y'),m x (y') + 1, • • • , M x (?/')}, 

where pos stands for position. Then we define the standardized position of y' to be 

,m x (y')-l M x (y>)^ 



spos x (y') 



n n 



In the following lemma we show that for every p G spos(y') (and only p G spos(y')), 
we have rq(p) = lq(p) = y' . For example if 1/2 G spos(y') then y' is the (left and 
right) median. 

Lemma 4.2: Suppose x = (x 1 , ■ ■ ■ , x n ), y = sort(x) = (y ± , ■ ■ ■ , y n ) and y' G {y ± , ■ ■ ■ , y n }. 
Then 

p G spos x (y') <^> lq x (p) = rq x (p) = y' . 

Proof Let z = (zi, ■ ■ ■ ,z r ) be the reduced vector with multiplicities mi, • • • , m r . 
Then y' = rrii for some % — 1 , • • • , r. 

case I: If i = 2, • • • , r, then 

m(y') = mi H h mj_i + 1, 

and 

M(y') = mi + ' • ' + m.j. 

case II: If i = 1, then m(y') = 1 and M(y') = m\. 

In any of the above cases for p G ( m ^^~ , — — ) and only p G ( m ^'~ , — — ) 



by definition. ■ 

Now we prove a lemma. It is easy to see that if u G pos(y') then 

u-l u , 

( ,-) C spos{y). 

n n 

We conclude that 

u — 1 u , 

Uuepos(y>){ , -) C spos(y ). 

n n 

In fact spos(y') can possibly have a few points on the edge of the intervals not in 

I I ( u—l u\ 

u uepos(y'){ n i n >- 

Lemma 4.3: Suppose a; is a data vector of length n and y' is an element of this vector. 
Also assume 

y >x h ie I, y < x j: j e J, 

lnJ = (f>, I, Jd {1,2, ••• ,n}. 
Then there exist a p in (^— , 1 — — ) that belongs to spos(y'). In other words 

kip) = r i(p) = y'- 

Proof From the assumption, we conclude that pos{y') includes a number between 
|7|andn— \J\. Let us call it u . Hence ( BiI ^-, — ) C spos(y'). Since |/| < u < n—\J\, 
we conclude that spos(y') intersects with 

u — I u ,\I\ — 1 \J\. 

U|/|<„<n-|J|( , "J C ( , 1 ). 

n n n n 



5 A loss function to assess approximations of quantiles 

Our purpose is to find good approximations to the median and other quantiles. We 
need a method to asses such approximations. We contend that such a method should 
not depend on the scale of the data. In other words it should be invariant under 
monotonic transformations. We define a function S that measures a natural "degree 
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of separation" between data points of a data vector x. For the sake of illustration, 
consider the example sort(x) = (1, 2, 3, 3, 4, 4, 4, 5, 6, 6, 7). Now suppose, we want to 
define the degree of separation of 3,4 and 7 in this example. Since 4 comes right after 
3, we consider their degree of separation to be zero. There are 3 elements between 
4 and 7 so it is appealing to measure their degree of separation as 3 but since the 
degree of separation should be relative, we cab also divide by n = 11, the length of 
the vector, and get: <5(4, 7) = 3/11. We can generalize this idea to get a definition 
for all pairs in M.. With the same example, suppose we want to compute the degree 
of separation between 2.5 and 4.5 that are not members of the data vector. Then 
since there are 5 elements of the data vector between these two values, we define 
their degree of separation as 5/11. More formally, we give the following definition. 

Definition 5.1: Suppose x = (xi, ■ ■ ■ ,x n ), a data vector and z < z' let A x (z,z') = 
{i\z < Xi < z', % — 1, • • • , n}. Then we define 

«;,/) = - |A * MI 



n 

and S x (z,z) = 0, where |A x (,z, z')| is the cardinality of A x (z,z'). We call S x the 
"degree of separation" (DOS) or the "probability loss function" associated with x. 

We then have the following lemma about the properties of 5. 

Lemma 5.1: The degree of separation S x has the following properties: 

a) S x > 0. 

b) y < y' < y" => S x (y,y") > 6 x (y,y'). 

c) S ( j ) ^ x )((f)(z), 4>{z')) = S x (z, z') if is a strictly monotonic transformation. 

d) y = sort(x) and y { < yj => 8 x (y h yj) < (j - i - l)/n. 

Proof 

Both a) and b) are straightforward. To show (c), suppose z < z' and is 
strictly decreasing. (The strictly increasing case is similar.) Then <f>(z') < <f)(z) and 
hence 

A^) (0(,?), 0(0) = {i\<t>{z') < <t>{xi) < <p(z)} = {i\z <Xi< z'} = A x (z,z'). 



11 



Finally d) is true because \A x (yi,yj)\ = \{l\yi < Xi < yj, I — 1, • • • ,n}\ < j — i — 1. 



Remark. The definition and results above can be applied to random vectors S = 
(Xi, • • • , X n ) as well. In that Ss(z, z') is random. 

Loss function for distributions 

We define a degree of separation for distributions which corresponds to the notion 
of "degree of separation" defined for data vectors to measure separation between 
data points. 

Definition 5.2: Suppose X has a distribution function F. Let 

S F ( Z ', z) = S F (z, z) = lim F(u) - F(z') = P(z' < X < z), z > z ', 

and 6p(z, z) — 0, z G R. We also denote this by Sx whenever a random variable 
X with distribution F is specified. We call 5x the "degree of separation" or the 
"probability loss function" associated with X. 

The following lemma is a straightforward consequence of the definition. 

Lemma 5.2: Suppose x = (x±, ■ ■ ■ ,x n ) is a data vector with the empirical distribu- 
tion F n . Then 

6 Fn (z,z') =6 x (z,z'), z,z' eR. 

This lemma implies that to prove a result about the degree of separation of data 
vectors, it suffices to show the result for the degree of separation of random variables. 

Theorem 5.1: Let X, Y be random variables and Fx, Fy, their corresponding distri- 
bution functions. 

a) Assume Y = <j>(X), for a strictly increasing or decreasing function : R — > R. 
Then 5 Fx (z,z') = S FY {<j>{z),<f>{^)), z < z' eR. 
b)5 Fx (z,z')<S Fx (z,z"), z< z'<z". 

c) 5 Fx (z 1 , z 3 ) < 5 Fx (z 1 , z 2 ) + 5 Fx (z 2 , z 3 ) + P(X = z 2 ). 

d) Suppose, p E [0, 1]. Then S Fx (lq Fx (p),rq Fx (p)) = 0. 

e) Suppose, p x < p 2 E [0, 1]. Then 5 Fx (lq Fx (pi),rq Fx (p 2 )) < p 2 - p ± . 
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Remark. We may restate Part (c), for data vectors: Suppose x has length n and 
z 2 is of multiplicity m, (which can be zero). Then the inequality in (c) is equivalent 
to S x (zu z 3 ) < 8 x {z u z 2 ) + 5 x (z 2 , z 3 ) + m/n. 

Proof 

a) Note that for a strictly increasing function 0, we have 

P(z < X < z) = P{(j){z) < <f>(X) < 4>{z')). 

Now suppose <p is strictly decreasing. Then z < z' =>- <f>(z') < <f)(z). Let Y = <f>(X). 
Then 

5 x (z,z') = P(z<X< z') = P(<j){z') < 0(X) < <f>{z)) = 6y(<f>(z),<f>{zf)). 

b) This is trivial. 

c) Consider the case z\ < z 2 < z%. (The other cases are easier to show.) Then 



5 Fx ( Zl , z 3 ) = P( Zl <X<z 3 ) = P( Zl <X<z 2 ) + P(X = z 2 ) + P{z 2 <X<z z ) 

= S Fx (z 1 ,z 2 ) + 5 Fx (z 2 ,z 3 ) + P{X = z 2 ). 

d) This result is a straightforward consequence of Lemma [1.11 b) and c). 

e) This result follows from 

h x {lq{pi),rq{p 2 )) = P(k(pi) <X < rq(p 2 )) 

= P(X < rq(p 2 )) - P(X < lq( Pl )) < p 2 - Pl . 
The last inequality being a result of Lemma [01 a) and d). 

Remark: (e),(b) immediately imply 

$F x (lqF x (pi),lqF x (P2)) < P2-P1, 

and 

S Fx (rq Fx (pi),lqF x (p2)) < P2 - Pi- 

Remark. We call part c) of the above theorem the pseudo-triangle inequality. 
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6 Data coarsening and quantile approximation algorithm 

This section introduces an algorithm to approximate quantiles in very large data 
vectors. As we demonstrated in the previous section the median of medians algo- 
rithm is not necessarily a good approximation to the exact median of a data vector 
even if we have a large number of partitions and large length of the partitions. The 
algorithm is based on the idea of "data coarsening" which we will discuss shortly. 
The proposed algorithm can give us approximations to the exact quantile of known 
precisions in terms of degree of separation. After stating the algorithm, we prove 
some theorems that give us the precision of the algorithm. The results hold for 
partitions of non-equal length. 

Definition 6.1: Suppose a data vector x of length n = n\n 2 is given, ni,n 2 > 1 E N. 
Also let sort(x) = y = (yi, ■ ■ ■ ,y n )- Then the n 2 -coarsening of x, C n2 (x) is defined 
tobe(y n2 ,y 2n2 ,--- ,2/(ni-i)n 2 )- Note that C n2 (x) has length m-1. Letp; = i/n^i = 
1, 2, • • • , (m - 1). Then C n2 {x) = (lq x (pi), ■■■ , /g x (Pm-i))- 

We can immediately generalize the coarsening operator. Suppose 

sort(x) = (j/!,... ,y n ), 

and n 2 < n is given. Then by The Quotient-Remainder Theorem from elementary 
number theory, there exist ri\ G N U {0} and r < n 2 such that n = ri\n 2 + r. 
Define C„ 2 (x) = (y n2 , ■ ■ ■ ,y n2 (n 1 -i))- The expression is similar to before. However, 
there are n 2 + r elements after y n2 i ni -i) in the sorted vector y. In this sense this 
coarsening is not fully symmetric. We show that if n 2 is small compared to n this 
lack of symmetry has a small effect on the approximation of quantiles. 

Suppose x is a data vector of length n = YllLi h- We introduce the coarsening 
algorithm to find approximations to the large data vectors. 



rf-Coarsening quantiles algorithm: 

1. Partition x into vectors of length Z 1; • • • , l m . (Or use pre-existing partitions, 
e.g. partitions of data saved in various files on the hard disk of a computer.) 



x — \ x li ' ' ' j x h)i x — (^h+l) ' ' ' i x h+h)i ' ' ' ' x ~ ( 3 -V'"" 1 L+l) ' ' ' 5 x n) 
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2. Sort each x l , I = 1, 2, • • • , m and let y l = sort(x l ), I — 1, • • • , m: 

y 1 = (yl---,yl 1 ),---,y m = (y?,---,yZ)- 

3. ri-Coarsen every vector: 

\y<i-> ■ ■ ■ ' ?/(ci-i)d)> ■ ■ ■ Ayd ■>' ' ' ' ^(cm-i)^)' 

and for simplicity drop d and use the notation w\ = y\ d . 

4. Stack all the above vectors into a single vector and call it w. Find rq w (p) (or 
lq w (p)) and call it /i. Then // is our approximation to rq x {p) (or lq x {p)). 

Theorem 6.1: Suppose x is of length n = Yl^Lih, ttl > 2 and U = C{d. Let C = 
YllLi c i- Apply the coarsening algorithm to x and find \x to approximate rq x (p) (or 
lq x (p))- Then fx is a (left and right) quantile in the interval 

[p-e,p + e], 

where e = ^r^- In other words S x (/i,rq x (p)) < e and 5 x (/j,, lq x {p)) < e. When 
^ = erf j = 1, . . . m , e = 2i±i i < 3 _ 

1 ' ' ' m— 1 c— 1 — c— 1 

We need an elementary lemma in the proof of this theorem. 

Lemma 6.1: (Two interval distance lemma) 

Suppose two intervals / = [a, b] and J = [c, d] subsets of R are given. Then 

sup{|p — q\,p & I,q E J} — max{|a — d\,\b — c\}. 



Proof sup{|p — q\,p G /, q G J} > max{|a — d\, \b — c\} is trivial because o, b G / 

and c,d £ J. 

To show the converse note that |p — g| = p — q or q — p, p G /, g G J. But 

p — g < 6 — c, 
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and 

q — p < d — a. 

Hence 

\p — q\ < max{6 — c, d — a} < max{\b — c\, \a — d\}. 

This completes the proof. ■ 

Proof of Theorem 16.11 

Let n! = J2^Li( c i ~ 1) = Yn=i c i ~ m = C - m and M c = {{i,j)\i = 
1, 2 ■ • • , m,j — 1, • • • , Cj — 1}, the index set of u\ Also let c = max{ci, • ■ ■ , c m }. 

Suppose, ^r < P < A, h — 1,- ■ • ,nf. Then since /z = rq w (p), there are 
disjoint subsets of Mc, i^ and i^' such that \K\ = h, \K'\ = n'—h, fi > Wj, (i,j) G K 
and /i < Wj, (i,j) E K'. (This is because if we let v = sort(w), rq w (p) = Vh since 
[n'p] = h — 1.) 

K, K' are not necessarily unique because of possible repetitions among the 
w\. Hence we impose another condition on K and K' . If (i, t) E K then (i, u) ^ 
K', u <t. It is always possible to arrange for this condition. For suppose, (i, t) E K 
and (i,u) E K', u <t. Then fi > w\ and /i < w l u , hence w\ < wf. But since u < t 
we have w\ < wf by the definition of w % . We conclude that w\ — wf. Now we can 
simply exchange (i,t) and (i,u) between K and K'. If we continue this procedure 
after finite number of steps we will get K and K' with the desired property. 

Now define 

tf 1 = {(i, 1)1(1,1)6*}, 

with \K\ | = /ci and 

/i = {(i,j)|j<d,(i,l)e*}, 

Then |/i| = fci<i. Also note that if (i,j) E I±, fi > w\ > y^. 

• Let 

* 2 = {(i,2)|,(i,2)6*}, 

with \K 2 1 = k 2 and 

I 2 = {(i,j)\d<j<2d,(i,2)EK}. 
Then |/ 2 | = k 2 d. Also note that if (i, j) E I 2 , /i > w 2 > y l y 
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• Let 

K t = {(i,t)\(i,t)eK}, 

with \K t \ = k t and 

I t = {(i,j)\(t-l)d<j<td,(i,t)eK}. 
Then \I t \ = k t d. Also note that if (i,j) G It, n > w\ > y*-. 

• Let 

K^ 1 = {(i,(c-l))\(i,c-l) E K}, 

with |-K" c -i| = fc c -i an d 

/ (c -i) = {(i,j)l(c-2)d<j<(c-l)d,(i,c-l)6iir}. 
Then |/ c -i| — k c -id. Also note that if (i,j) G /( c -i), ^ > tu| c _^ > yj. 

Note that K = U%~lK t , \K\ = hi, + • • • + k c -±. Since the lf t are disjoint the 
I t are also disjoint. Let I = ^ c t z\l t then |/| = d{k\ + • • • + fc c -i) = d|if |. Also note 
that (i,j) E I => /J, > y^. 

Similarly define, 

• 

K[ = {(iA)\(i,l)eK'},\K[\ = k' 1 , 

and 

l' 1 = {(i,j)\d<j<2d,(i,i)eK'}. 

Then |/(| = k[d. Also note that if (i,j) G I[, \i < w{ < y). 

• Let 

K' 2 = {{i,2)\{i,2)eK%\K' 2 \ = h> 2 , 

and 

I' 2 = {{i,j)\2d < j <?>d,{i,2) e K'}. 

Then l^l = k' 2 d. Also note that if (i,j) G I 2 , jx < w 2 < y]. 
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• Let 

Kl = {(i,t)\(i,t)eK'},\Kl\ = k% 

and 

l' t = {(i,j)\td<j<{t + l)d,{i,t)eK'}. 

Then \I' t \ = k[d. Also note that if (i, j) E V t then yt, <w\< yj. 

<-l = {(»', (C- l))|(i,C- 1) G X'}, |^_ll = fcU 

and 

Then |/,Lil = ^c-i^- Also note that ^ (*?i) e ^c-i ^ A 4 < ^fc-i) — 2/j- 

Then |/| = |A"|d and |/'| = |iT'|d. We claim that / n I' = 0. To see this note that 
because of how the second components in I t and I[ are defined, it is only possible 
that I t+ i = {{i,j)\td < j < (t+ l)d,{i,t+ 1) E K} and I[ = {{i,j)\td < j < 
(t + l)d, (i, t) E K'} intersect for some t = 1, • • • , c — 2. But if they intersect then 
there exist i, t such that (i,t+l) E K and (i, t) E K' which is against our assumption 
regarding K and K' . Hence by Lemma [4.31 /i is a quantile between 

r |-K"|d n—\K'\d, hd n—(n' — h)d } h m + h 



n n YJILi ^ YnLi c i d c" c 



But we know that 

r h-1 h 



V C — m C — m 

We are dealing with two interval in one of them /i is a quantile and the other contains 
p. 

We showed in Lemma 16.11 if two intervals [a, b] and [c, d] are given, the sup 
distance between two elements of the two intervals is 

max{|a — d\, \b — c\}. 
Applying this to the above two intervals we get, 

rn + h h — 1 h — 1 h 
C C — m C — m C 
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which is equal to, 



mC — m 2 — hm + C C — hm 
max{l CjC^m) 1' l C(C^n) l} - 



But m 2 + hm <m 2 + (C — m)m = mC . Hence 



. mC — m 2 — hm + C . mC — m 2 — hm + C mC + C m 

< 



C{C - m) ' C(C -m) ~ C(C - m) C-m 

Also 

, C — hm , C + mC m + 1 



1 C(C -my- C(C -m) - C-m 

Hence the max is smaller than e = ^r^ and we conclude that /i is a quantile for p' 
which is at most as far as e to p. 

The case /j = cd is easily obtained by replacing C = mc and noting that 
2*±i <3, m>2. M 

TO— 1 — ' — 

In most applications, usually the data partitions are not divisible by d. For 
example the data might be stored in files of different length with common factors. 
Another situation involves a very large file that is needed to be read in successive 
stages because of memory limitations. Suppose that we need a precision e (in terms 
of degree of separation) and based on that we find an appropriate c and m. Note 
that n might not be divisible by mc. 

First we prove two lemmas. These lemmas show what happens to the quan- 
tiles if we throw away a small portion of the data vector or add some more data to it. 
The first lemma is for a situation that we have thrown away or ignored a small part 
of the data. The second lemma is for a situation that a small part of the data are 
contaminated or includes outliers. In both cases, we show how the quantiles com- 
puted in the "imperfect" vectors correspond to the quantiles of the original vector. 
In both case x stands for the imperfect vector and w is the complete/clean data. 

Lemma 6.2: (Missing data quantile approximation lemma) 

Suppose x = (xi, • ■ • ,x n ), sort(x) = (yi, • • • , y n ) andy' = lq x (p),P £ [0, 1]. Consider 
a vector x* of length n* and let w = stack(x,x*). Then y' = lq w (p'), where p' G 
\p-e,p + e] and e= ^-^. 

Similarly if y' = rq x {p) and pG [0, 1], y' — rq w (p'), where p' G [p — e,p + e] 
and e = - J f-^. 

n+n* 
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Proof We prove the result for lq x only and a similar argument works for rq x . 
Let z = sort(w) then lq z = lq w . For p = 1 the result is easy to see. Otherwise, 
- < P < ~ f° r some i — 0, • • • , n — 1. But then y' = lq x {p) = y{. In the new 
vector 2 since we have added n* elements y' = Zj for some j,i<j<i + n*. Hence 
y' — IQz(^+^*)- From np — 1 < i < np, we conclude 

np — 1 i j i + n* np + n* 
< < — - — < < ' 



Hence, 



n + n* n + n* n + n* n + n* n + n* 



n (1 — p) — 1 ? n (1 — p) 
— — < — p< — — =► 

n + n* n + n* n + n* 

j | ^ r| n*(l— p) — 1. ,n*(l-p) 

-p| <max{| ■ — 1,| ■ — — |}. 



n + n* n + n* n + n* 



But |^#| < -£r and p*^" 1 ! < max{^± -^} since p ranges in [0, 11. We 

I n+n* I — n+n* < n+n* I — L n+n* ' n+n* ' r ° L ' J 



n+n 

conclude that that 

j i n * 

P\ < 

n + n* n + n* 



Lemma 6.3: (Contaminated data quantile approximation lemma) 

Suppose x = (xi, • • • , x n ), sort(x) = (yi, • • • , y n ) and y' = lq x (p),P G [0, 1]. Consider 

the vector w = {x\,X2, • • • ,£„_„*) then y' = lq w (p'), where p' G [p — e,p + e] and 

n— n* 

Similarly if y' = rq x {p) and p G [0, 1], y' — rq w (p'), where p' G [p — e,p + e] 
and e = -^-r. 

Proof We only show the case for /g x and a similar argument works for rq x . 
Let z = sort{w). Then Zg 2 = /g w . If p = 1 the result is easy to see. Otherwise, 
- < p < — for some i — 0, • • • ,n — 1. But then y' = lq x {p) = Vi- In the new 
vector 2 since we have removed n* elements y' = Zj for some j, i — n* < j < i. 
Hence y' = lq z ( _f „ ). From np — 1 < i < np, we conclude np — 1 — n* < j < np =>- 
np — n* < j < np. Hence 

— n* + n*p j n*p 

< -~P< 



n — n n — n* n — n 

j n* 

I P < • 

n — n* n — n* 
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In the case that the partitions are not divisible by d, we can use the same 
algorithm with generalized coarsening. The error will increase obviously and the 
next two lemmas say by how much. 

Lemma 6.4: Suppose x has length n = Im + r, < r < I and m = cd. To find 
lq x (p), apply the algorithm in the previous theorems to a sub-vector of x of length 
Im. Then the obtained quantile is a quantile for a number in [p — e,p + e], where 

to+1 1 i r 



to— 1 e— 1 lin+r ' 



Proof The result is a straightforward consequence of the Theorem 16.11 and the 
Lemma I 



Lemma 6.5: Suppose x has length n = Y1T=\ h an< ^ ^ = c ^ + r «> r * < ^- Let -^ = 
X]r=i r *- Then apply the algorithm above to x to find lq x (p), using the generalized 
coarsening. The obtained quantile is a quantile for a number in [p — e,p + e] where 

771+1 _|_ R 

fc — C-to ' R+Cd- 

Proof Let l[ = Cid. Consider %' a sub- vector of x consisting of 

Then x' has length Y^Li ^- ^Y Lemma f6.2l p-th quantile found by the algorithm is 
a quantile in [p — ei,p + ei], €\ = ^z^ f° r x '- x nas -R — Sl^i r * elements more than 
x'. Hence the obtained quantile is a quantile for x for a number in [p — e,p + e], 

e = ei + ''' 



R+Cd' 
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Suppose a data vector x has length n. To find the quantiles of this vector, we only 
need to sort x. Since then for any p G (0,1), we can find the first h such that 
p > h/n. Note that 

Tl — 1 

sort(x) = (lq x (l/n), lq x (2/n), ■ ■ • , ^(1)) = (rq x (0),rq x (l/n), ■■■ , rq x ( )). 

n 
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We only focus on left quantiles here. Similar arguments hold for the right quantile. 
Obviously, the longer the vector x, the finer the resulting quantiles are. Now 
imagine that we are given a very long data vector which cannot even be loaded 
on the computer memory. Firstly, sorting this data is a challenge and secondly, 
reporting the whole sorted vector is not feasible. Assume that we are given the 
sorted data vector so that we do not need to sort it. What would be an appropriate 
summary to report as the quantiles? As we noted also the sorted vector itself 
although appropriate, maybe of such length as to make further computation and 
file transfer impossible. The natural alternative would be to coarsen the data vector 
and report the resulting coarsened vector. To be more precise, suppose, length(x) = 
n = riin 2 and y = sort(x) = (yi, ■ ■ ■ , y n ). Then we can report 

V = Cn 2 (y) = (Vn 2 , ■ ■ ■ ,J/(ni-l)na)- 

This corresponds to 

(lq yl (l/n 2 ),--- ihA 1 ))- 
How much will be lost by this coarsening? Suppose, we require the left quantile 
corresponding to (h — l)/n < p < h/n, h = 1, • • • , n. Then x would give us yh- But 
since (h — l)/n < p < h/n 

np < h < np + 1 . 
Also suppose for some h' = 1, • • • , ni, 

(ti - l)/(ni - 1) < p < {h')/{rn - 1) => {ti - 1) < p{n x -\)<ti 

=^ (m - l)p <h'< p(ni - 1) + 1. 

Then 

(h - l)(m - l)/n < ti < h{n x - I) /n + 1, 

and 

(h — l)(ni — \)n 2 /n < h'n 2 < h(ni — l)n 2 /n + n 2 . (1) 

Using the coarsened vector, we would report yh'(n 2 ) as the approximated quantile 
for p. The degree of separation between this element and the exact quantile using 
Equation [T] is less than or equal to 

Ah— (h— l)(Vii — l)n 2 /n\ \h(n\ — l)n 2 /n + n 2 — h\ -. 

maxj , }. 

n n 
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This equals 

— hn 2 — n\n 2 + n 2 —hn 2 + nn 2 



max il ~2 -I - 1 — — ~2 "I!- 

But 



n 2 n 2 



-hn 2 - n 1 n 2 + n 2 n 2 (n 1 + n - I) n 2 (n 1 + n) 1 n 2 



n 2 n* n* n n 



2 


n 2 


-hn 2 + nn 2 ^ n 2 
n 2 n 





■ 



and 



Hence the degree of separation is less than 1/n + l/ni. We have proved the following 
lemma. 

Lemma 7.1: Suppose x is a data vector of the length n = n\n 2 and y = sort(x), 
y' = C n2 (y). Then if we use the quantiles of y' in place of x, the accuracy lost in 
terms of the probability loss of x (5 X ) is less than 1/n + l/n\. 

The algorithm proposes that instead of sorting the whole vector and then 
coarsening it, coarsen partitions of the data. The accuracy of the quantiles obtained 
in this way is given in the theorems of the previous section. This allows us to load 
the data into the memory in stages and avoid program failure due to the length of 
the data vector. We are also interested in the performance of the method in terms of 
speed, and do a simulation study using the "R" package (a well-known software for 
statistical analysis) to assess this. In order to see theoretical results regarding the 



comp lexity of the special case of the algorithm for equal partitions see lAlsabti et al. 



19971 ] . For the simulation study, we create a vector, x, of length n = 10 7 . We apply 
the algorithm for m = 1000, c = 20, d = 500. We create this vector in a loop of 
length 1000. During each iteration of the loop, we generate a random mean for a 
normal distribution by first sampling from N(0, 100). Then we sample 10,000 points 
from a normal distribution with this mean and standard deviation 1. We compare 
two scenarios: 

1. Start by a NULL vector x and in each iteration add the full generated vector 
of length 10000 to x. After the loop has completed its run, sort the data vector 
which now has length 10 7 by the command sort in R and use this to find the 
quantiles. 
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2. Start with a NULL vector w. During each iteration after generating the ran- 
dom vector, d-coarsen the data by d — 500. (Hence m = 1000, c = 20.) In 
order to do that computing, first apply the sort command to the data and then 
simply (i-coarsen the resulting sorted vector. During each iteration, add the 
coarsened vector to w. After all the iterations, sort w and use it to approximate 
quantiles. 



Remark. The first part corresponds to the straightforward quantiles' calculation 
and the second corresponds to our algorithm. Note that in the real examples instead 
of the loop, we could have a list of 1000 data files and still this example serves as a 
way of comparing the straightforward method and our algorithm. 

Remark. Note that if we wanted to create an even longer vector say of length 10 10 
then the first method would not even complete because the computer would run out 
of memory in saving the whole vector x. 

Remark. The final stage of the algorithm can use the fact that w is built of 
ordered vectors to make the algorithm even faster. We will leave that a problem to 
be investigated in the future. 

We have repeated the same procedure for n = 2x 10 7 , m = 1000, d = 500 and 
n = 10 8 ,m = 1000, d = 500. The results of the simulation are given in Table EJ in 
which "DOS" stands for the degree of separation between the exact median and the 
approximated median. The "DOS bound" bounds the degree of separation obtained 
by the theorems in the previous section. For n = 10 7 , n = 2 x 10 7 significant time 
accrue by using the algorithm. For a vector of length 10 8 , R crashed when we tried 
to sort the original vector and only the algorithm could provide results. For all 
cases the exact and approximated quantiles are close. In fact the dos is significantly 
smaller than the dos bound. This is because this is a "worst-case" bound. The exact 
and approximated quantiles for n = 10 7 are plotted in Figure [TJ 
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Fig. 1: Comparing the approximated quantiles to the exact quantiles N = 10 7 . The 
circles are the exact quantiles and the + are the corresponding approximated 
quantiles. 



Length 


n= 10 7 


n = 2 x 10 7 


n = 10 8 


Exact median value 


1.847120 


1.857168 


NA 


Algorithm median value 


1.866882 


1.846463 


1.846027 


DOS 


0.00012 


-6.475 x 10~ 5 


NA 


DOS bound 


0.05268421 


0.02566667 


0.005030151 


Time for exact median 


186 sec 


461 s 


NA 


Time for the algorithm 


6 sec 


18 s 


98 s 



Tab. 2: Comparing the exact method with the proposed algorithm in R run on a 
laptop with 512 MB memory and a processor 1500 MHZ, m = 1000, d = 500. 
"DOS" stands for degree of separation in the original vector. "DOS bound" 
is the theoretical degree of separation obtained by Theorem 16.11 



Next, we apply the algorithm on a real dataset. The dataset includes the 
daily maximum temperature for 25 stations over Alberta during the period 1940- 
2004. We focus on the 95th percentile. The results are given in Table [3j The 
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0.4 0.6 

lq(P) 



Fig. 2: Comparing the approximated quantiles to the exact quantiles for MT (daily 
maximum temperature) over 25 stations in Alberta 1940-2004. The circles 
are the exact quantiles and the + the approximated quantiles. 
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algorithm finds the percentile more quickly but the time difference is not as large 
as the simulation. This is because most of the time of the algorithm and the exact 
computation is spent on reading the files from the hard drive. The dos bound is 
about 0.01 (on the 0-1 probability scale). The true degree of separation is about 
0.001. The estimated quantiles and the exact quantiles are plotted in Figure [2j 
Notice that the exact and approximated values match except at the very beginning 
(very close to zero) and end (when it is close to 1), where we see that the circles 
(corresponding to exact quantiles) and the +s (corresponding to the approximated 
quantiles) do not completely match. This difference is at most 0.01 in terms of dos 
in any case. 



Exact 95th percentile 


27 C 


Algorithm 95th percentile 


26.7 C 


DOS 


0.001278726 


DOS bound 


0.01052189 


time for exact median 


8 min 6 sec 


time for the algorithm 


7 min 29 sec 



Tab. 3: Comparing the exact method with the proposed algorithm in R (run on 
a laptop with 512 MB memory and processor 1500 MHZ) to compute the 
quantiles of MT (daily maximum temperature) over 25 stations with data 
from 1940 to 2004. 

Acknowledgements: I would like to thank Jim Zidek and Nhu Le for insightful 
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